Large Scale Lexical Analysis
نویسندگان
چکیده
The following paper presents a lexical analysis component as implemented in the PANACEA project. The goal is to automatically extract lexicon entries from crawled corpora, in an attempt to use corpus-based methods for high-quality linguistic text processing, and to focus on the quality of data without neglecting quantitative aspects. Lexical analysis has the task to assign linguistic information (like: part of speech, inflectional class, gender, subcategorisation frame, semantic properties etc.) to all parts of the input text. If tokens are ambiguous, lexical analysis must provide all possible sets of annotation for later (syntactic) disambiguation, be it tagging, or full parsing. The paper presents an approach for assigning part-of-speech tags for German and English to large input corpora (> 50 mio tokens), providing a workflow which takes as input crawled corpora and provides POS-tagged lemmata ready for lexicon integration. Tools include sentence splitting, lexicon lookup, decomposition, and POS defaulting. Evaluation shows that the overall error rate can be brought down to about 2% if language resources are properly designed. The complete workflow is implemented as a sequence of web services integrated into the PANACEA platform.
منابع مشابه
Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities
This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...
متن کاملAssessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis
OBJECTIVE Identify the lexical content of a large corpus of ordinary medical records to assess the feasibility of large-scale natural language processing. METHODS A corpus of 560 megabytes of medical record text from an academic medical center was broken into individual words and compared with the words in six medical vocabularies, a common word list, and a database of patient names. Unrecogn...
متن کاملCLCL-A Clustering Algorithm Based on Lexical Chain for Large-Scale Documents
Along with explosion of information, how to cluster large-scale documents has become more and more important. This paper proposes a novel document clustering algorithm (CLCL) to solve this problem. This algorithm first constructs lexical chains from feature space to reflect different topics which input documents contain, and documents also can be separated into clusters by these lexical chains....
متن کاملBabelDomains: Large-Scale Domain Labeling of Lexical Resources
In this paper we present BabelDomains, a unified resource which provides lexical items with information about domains of knowledge. We propose an automatic method that uses knowledge from various lexical resources, exploiting both distributional and graph-based clues, to accurately propagate domain information. We evaluate our methodology intrinsically on two lexical resources (WordNet and Babe...
متن کاملCross-lingual neighborhood effects in generalized lexical decision and natural reading.
The present study assessed intra- and cross-lingual neighborhood effects, using both a generalized lexical decision task and an analysis of a large-scale bilingual eye-tracking corpus (Cop, Dirix, Drieghe, & Duyck, 2016). Using new neighborhood density and frequency measures, the general lexical decision task yielded an inhibitory cross-lingual neighborhood density effect on reading times of se...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012